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Title of the Invention 



Scalable File Server with Highly Available Pairs 



Background of the Invention 



Field of the Invention 



The invention relates to storage systems. 



Related Art 



Computer storage systems are used to record and retrieve data. One way 
storage systems are characterized is by the amount of storage capacity they have. The 
capacity for storage systems has increased greatly over time. One problem in the known 
art is the difficulty of planning ahead for desired increases in storage capacity. A related 
problem in the known art is the difficulty in providing scalable storage at a relatively ef- 
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1 ficient cost. This has subjected customers to a dilemma; one can either purchase a file 

2 system with a single large file server, or purchase a file system with a number of smaller 

3 file servers. 

4 

5 The single-server option has several drawbacks. (1) The customer must 

6 buy a larger file system than currently desired, so as to have room available for future ex- 

7 pansion. (2) The entire file system can become unavailable if the file server fails for any 

!B8% 8 reason. (3) The file system, although initially larger, is not easily scalable if the customer 
•0 

jjj 9 comes to desire a system that is larger than originally planned capacity. 

m 

: \mo 

! 3 I 

! ~$ 

j Ell The multi-server option also has several drawbacks. In systems in which 

1 i 

□12 the individual components of the multi-server device are tightly coordinated, (1) the same 

!Vi3 scalability problem occurs for the coordinating capacity for the individual components. 

i fl 

'ST 9 

]%\4 That is, the customer must buy more coordinating capacity than currently desired, so as to 

15 have room available for future expansion. (2) The individual components are themselves 

16 often obsolete by the time the planned- for greater capacity is actually needed. (3) Tightly 

17 coordinated systems are often very expensive relative to the amount of scalability de- 

18 sired. 
19 

20 In systems in which the individual components of the multi-server device 

21 are only loosely coordinated, it is difficult to cause the individual components to behave 

22 in a coordinated manner so as to emulate a single file server. Although failure of a single 
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1 file server does not cause the entire file system to become unavailable, it does cause any 

2 files stored on that particular file server to become unavailable. If those files were critical 

3 to operation of the system, or some subsystem thereof, the applicable system or subsys- 

4 tern will be unavailable as a result. Administrative difficulties generally increase to due 

5 to a larger number of smaller file servers. 
6 

7 Accordingly, it would be advantageous to provide a method and system for 

8 performing a file server system that is scalable, that is, which can be increased in capacity 

Q 9 without major system alterations, and which is relatively cost efficient with regard to that 

j^io scalability. This advantage is achieved in an embodiment of the invention in which a pha- 
lli 

§1 1 rality of file server nodes (each a pair of file servers) are interconnected. Each file server 

□ 12 node has a pair of controllers for simultaneously controlling a set of storage elements 

m 

:~! 3 

:~13 such as disk drives. File server commands are routed among file server nodes to the file 

i j! 

;Q14 server node having control of applicable storage elements, and in which each pair of file 

15 servers is reliable due to redundancy. 

16 

17 It would also be advantageous to provide a storage system that is resistant 

18 to failures of individual system elements, and that can continue to operate after any single 

19 point of failure. This advantage is achieved in an embodiment of the invention like that 

20 described in co-pending Application Serial No. 09/037,652 filed March 10,1998, E*- 

21 prpsg JUm4H ^H- 1 in g Nq, P f F ^4^fil34 AWS J in the name of the same inventor, titled 
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1 Available File Servers", attorne y - docket number NAP 012, hereby incorporated by refer- 

2 ence as if fully set forth herein. 



4 Summary of the Invention 

5 

6 The invention provides a file server system and a method for operating that 

7 system, which is easily scalable in number and type of individual components. A plural- 
ity of file server nodes (each a pair of file servers) are coupled using inter-node connec- 

. i~ 

1x19 tivity, such as an inter-node network, so that any one pair can be accessed from any other 

i 

:»lo pair. Each file server node includes a pair of file servers, each of which has a memory 

1*1 

Ml and each of which conducts file server operations by simultaneously writing to its own 

Q2 memory and to that of its twin, the pair being used to simultaneously control a set of stor- 

IJJ 

! B3 age elements such as disk drives. File server commands or requests directed to particular 

,p!4 mass storage elements are routed among file server nodes using an inter-node switch and 

15 processed by the file server nodes controlling those particular storage elements. Each file 

16 server node (that is, each pair of file servers) is reliable due to its own redundancy. 
17 

18 In a preferred embodiment, the mass storage elements are disposed and 

19 controlled to form a redundant array, such as a RAID storage system. The inter-node 

20 network and inter-node switch are redundant, and file server commands or requests ar- 

21 riving at the network of pairs are coupled using the network and the switch to the appro- 

22 priate pair and processed at that pair. Thus, each pair can be reached from each other 
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1 pair, and no single point of failure prevents access to any individual storage element. The 

2 file servers are disposed and controlled to recognize failures of any single element in the 

3 file server system and to provide access to all mass storage elements despite any such 

4 failures. 

5 

6 Brief Description of the Drawings 

7 

8 Figure 1 shows a block diagram of a scalable and highly available file 



'0 

ijj 9 server system. 

ii) 

Wio 

y 

%n Figure 2 A shows a block diagram of a first interconnect system for the file 

q12 server system. 

if! 

^14 Figure 2B shows a block diagram of a second interconnect system for the 

15 file server system. 

16 

17 Figure 3 shows a process flow diagram of operation of the file server sys- 

18 tem. 

19 
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l Detailed Description of the Preferred Embodiment 

2 

3 In the following description, a preferred embodiment of the invention is de- 

4 scribed with regard to preferred process steps and data structures. However, those skilled 

5 in the art would recognize, after perusal of this application, that embodiments of the in- 

6 vention may be implemented using one or more general purpose processors (or special 

7 purpose processors adapted to the particular process steps and data structures) operating 

8 under program control, and that implementation of the preferred process steps and data 
iTj 9 structures described herein using such equipment would not require undue experimenta- 
UJio tion or further invention. 

u 

□12 Inventions described herein can be used in conjunction with inventions de- 

iV-i 

I ft* 

^13 scribed in the following applications: 

in 

,nl4 

15 • Application Serial No. 09/037,652, filed March 10, 1998, Express Mail Mailing 

16 No. EE143637441US, in the name of the same inventor, titled "Scalable and 

OS Ci** 7 , V2 *A ^ 

17 Highly Available File Server", nttnrnry rln rkr t nu mbe r NAP 01 ? ■ 

18 

19 This application is hereby incorporated by reference as if fully set forth 

20 herein. It is herein referred to as the "Availability Disclosure." 

21 



6 



Exp. Mail EK025321 




NAP-010 



l File Server System 
2 

3 Figure 1 shows a block diagram of a scalable and highly available file 

4 server system. 
5 

6 A file server system 100 includes a set of file servers 110, each including a 

7 coupled pair of file server nodes 111 having co-coupled common sets of mass storage de- 
^ 8 vices 112. Each node 1 1 1 is like the file server node further described in the Availability 
iTj 9 Disclosure. Each node 1 1 1 is coupled to a common interconnect 120. Each node 1 1 1 is 
UJlo also coupled to a first network switch 130 and a second network switch 130. 

[□12 Each node 1 1 1 is coupled to the common interconnect 120, so as to be able 

'j -3,3 

! y 13 to transmit information between any two file servers 110. The common interconnect 120 

214 includes a set of communication links (not shown) which are redundant in the sense that 

15 even if any single communication link fails, each node 111 can still be contacted by each 

16 other node 111. 

17 

18 In a preferred embodiment, the common interconnect 120 includes a 

19 NUMA (non-uniform memory access) interconnect, such as the SCI interconnect oper- 

20 ating at 1 gigabyte per second or the SCI-lite interconnect operating at 125 megabytes per 

21 second. 

22 

7 
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1 Each file server 1 10 is coupled to the first network switch 130, so as to re- 

2 ceive and respond to file server requests transmitted therefrom. In a preferred embodi- 

3 ment there is also a second network switch 130, although the second network switch 130 

4 is not required for operation of the file server system 100. Similar to the first network 

5 switch 130, each file server 1 10 is coupled to the second network switch 130, so as to re- 

6 ceive and respond to file server requests transmitted therefrom. 

7 

, 8 File Server System Operation 



m 

;4o In operation of the file server system 100, as further described herein, a se- 

jEjjl quence of file server requests arrives at the first network switch 130 or, if the second 

ill 2 network switch 130 is present, at either the first network switch 130 or the second net- 

13 

r ^3 work switch 130. Either network switch 130 routes each file server request in its se- 

iJl 
.■n 

s fjl4 quence to the particular file server 1 10 that is associated with the particular mass storage 
15 device needed for processing the file server request. 
16 

17 One of the two nodes 1 1 1 at the designated file server 1 10 services the file 

18 server request and makes a file server response. The file server response is routed by one 

19 of the network switches 130 back to a source of the request. 

20 



8 



Exp. Mail EK025321^reUS W NAP-010 

l Interconnect System 

2 

3 Figure 2 A shows a block diagram of a first interconnect system for the file 

4 server system. 

5 

6 In a first preferred embodiment, the interconnect 120 includes a plurality of 

7 nodes 111, each of which is part of a file server 1 10. The nodes 111 are each disposed on 

8 a communication ring 211. Messages are transmitted between adjacent nodes 111 on 

I™: 

'=5? 

1 71 9 each ring 211. 

'A? 

yio 

P 11 In this first preferred embodiment, each ring 211 comprises an SCI (Scal- 
es 12 able Coherent Interconnect) network according to IEEE standard 1596-1992, or an SCI- 
ilji3 lite network according to IEEE standard 1394.1. Both IEEE standard 1596-1992 and 

j ?5 

! 5i4 IEEE standard 1394.1 support remote memory access and DMA; the combination of 

15 these features is often called NUMA (non-uniform memory access). SCI networks oper- 

16 ate at a data transmission rate of about 1 gigabyte per second; SCI-lite networks operate 

17 at a data transmission rate of about 125 megabytes per second. 

lT^ 5 ^^^ / A communication swi^i 212 couples adjacent rings 211. The communi- 

20 cation switch 212 receives and transmraymessages on each ring 211, and operates to 

21 bridge messages from a first ring 211 to a seconftsqng 211. The communication switch 

22 212 bridges those messages that are transmitted on the Steering 211 and designated for 
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1 transmission to the second ring 2sJ 1. A switch 212 can also be coupled directly to a file 

2 server node 110. 
3 

4 In this first preferred embodiment, each ring 211 has a single node 1 1 1, so 

5 as to prevent any single point of failure (such as failure of the ring 21 1 or its switch 212) 

6 from preventing communication to more than one node 111. 

7 

_ 8 Figure 2B shows a block diagram of a second interconnect system for the 

•3 

, Z\ 9 file server system. 

. MR. 

! K 5 

Wo 

I ! 3 

Jji In a second preferred embodiment, the interconnect 120 includes a plurality 

1-42 of nodes 111, each of which is part of a file server 110. Each node 1 1 1 includes an asso- 

m 

\U3 ciated network interface element 114. In a preferred embodiment, the network interface 

iH 

! §4 element 1 14 for each node 1 1 1 is like that described in the Availability Disclosure. 

: y 

15 

16 The network interface elements 114 are coupled using a plurality of com- 

17 munication links 221, each of which couples two network interface elements 114 and 

18 communicates messages therebetween. 
19 

20 The network interface elements 114 have sufficient communication links 

21 221 to form a redundant communication network, so as to prevent any single point of 
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1 failure (such as failure of any one network interface element 114) from preventing com- 

2 munication to more than one node 111. 

3 

4 In this second preferred embodiment, the network interface elements 114 

5 are disposed with the communication links 221 to form a logical torus, in which each 

6 network interface element 114 is disposed on two logically orthogonal communication 

7 rings using the communication links 221 . 

l7j9 In this second preferred embodiment, each of the logically orthogonal 

' tit 

Wo communication rings comprises an SCI network or an SCI-lite network, similar to the 

; ; H 

;5l SCI network or SCI-lite network described with reference to figure 2 A. 

: it 

: It 3 

j M 3 Operation Process Flow 

: Ji t 

15 Figure 3 shows a process flow diagram of operation of the file server sys- 

16 tern. 

17 

18 A method 300 is performed by the components of the file server system 

19 100, and includes a set of flow points and process steps as described herein. 

20 

21 At a flow point 310, a device coupled to the file server system 100 desires 

22 to make a file system request. 

ll 
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1 

2 At a step 311, the device transmits a file system request to a selected net- 

3 work switch 130 coupled to the file server system 100. 

4 

5 At a step 312, the network switch 130 receives the file system request. The 

6 network switch 130 determines which mass storage device the request applies to, and 

7 determines which file server 110 is coupled to that mass storage device. The network 

8 switch 130 transmits the request to that file server 1 10 (that is, to both of its nodes 1 1 1 in 



!7i 9 parallel), using the interconnect 120. 

! :'i 

;|ii At a step 313, the file server 110 receives the file system request. Each 

[U 12 node 1 1 1 at the file server 1 10 queues the request for processing. 

i 

rui3 

ill 

! *J 14 At a step 3 14, one of the two nodes 1 1 1 at the file server 1 1 0 processes the 

15 file system request and responds thereto. The other one of the two nodes 1 1 1 at the file 

16 server 110 discards the request without further processing. 

17 

18 At a flow point 320, the file system request has been successfully proc- 

19 essed. 

20 
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1 If any single point of failure occurs between the requesting device and the 

2 mass storage device to which the file system request applies, the file server system 100 is 

3 still able to process the request and respond to the requesting device. 

4 

5 • If either one of the network switches 130 fails, the other network switch 130 is 

6 able to receive the file system request and transmit it to the appropriate file server 

7 110. 

W9 • If any link in the interconnect 120 fails, the remaining links in the interconnect 

l^io 120 are able to transmit the message to the appropriate file server 110. 

an 

; j 

^12 • If either node 1 1 1 at the file server 1 10 fails, the other node 1 1 1 is able to process 

'3 

|1| 

jfji3 the file system request using the appropriate mass storage device. Because nodes 
■0 

; Sl4 111 at each file server 110 are coupled in pairs, each file server 110 is highly 

15 available. Because file servers 110 are coupled together for managing collections 

16 of mass storage devices, the entire system 100 is scalable by addition of file serv- 

17 ers 1 10. Thus, each cluster of file servers 1 10 is scalable by addition of file serv- 

18 ers 110. 
19 

20 • If any one of the mass storage devices (other than the actual target of the file sys- 

21 tern request) fails, there is no effect on the ability of the other mass storage devices 
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to respond to processing of the request, and there is no effect on either of the two 
nodes 111 which process requests for that mass storage device. 

Alternative Embodiments 

Although preferred embodiments are disclosed herein, many variations are 
possible which remain within the concept, scope, and spirit of the invention, and these 
variations would become clear to those skilled in the art after perusal of this application. 
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